spatial intelligence
SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence
Wu, Haoning, Huang, Xiao, Chen, Yaohui, Zhang, Ya, Wang, Yanfeng, Xie, Weidi
Existing evaluations of multimodal large language models (MLLMs) on spatial intelligence are typically fragmented and limited in scope. In this work, we aim to conduct a holistic assessment of the spatial understanding capabilities of modern MLLMs and propose complementary data-driven and agent-based solutions. Specifically, we make the following contributions: (i) we introduce SpatialScore, to our knowledge, the most comprehensive and diverse benchmark for multimodal spatial intelligence to date. It covers multiple visual data types, input modalities, and question-answering formats, and contains approximately 5K manually verified samples spanning 30 distinct tasks; (ii) using SpatialScore, we extensively evaluate 40 representative MLLMs, revealing persistent challenges and a substantial gap between current models and human-level spatial intelligence; (iii) to advance model capabilities, we construct SpatialCorpus, a large-scale training resource with 331K multimodal QA samples that supports fine-tuning on spatial reasoning tasks and significantly improves the performance of existing models (e.g., Qwen3-VL); (iv) to complement this data-driven route with a training-free paradigm, we develop SpatialAgent, a multi-agent system equipped with 12 specialized spatial perception tools that supports both Plan-Execute and ReAct reasoning, enabling substantial gains in spatial reasoning without additional model training. Extensive experiments and in-depth analyses demonstrate the effectiveness of our benchmark, corpus, and agent framework. We expect these resources to serve as a solid foundation for advancing MLLMs toward human-level spatial intelligence. All data, code, and models will be released to the research community.
- Asia > China > Shanghai > Shanghai (0.04)
- South America > Brazil (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Inside Fei-Fei Li's Plan to Build AI-Powered Virtual Worlds
Pillay is an editorial fellow at TIME. Pillay is an editorial fellow at TIME. Recent AI progress has followed a pattern. Across text, image, audio, and video, once the right technical foundations were discovered, it only took a few years for AI-generated outputs to go from merely passable to indistinguishable from human creation. Although it's early, recent advances suggest that virtual worlds--3D environments you can explore and interact with--could be next.
- North America > Canada > Ontario > Middlesex County > London (0.05)
- North America > United States (0.05)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Human Computer Interaction > Interfaces (0.75)
- Information Technology > Artificial Intelligence > Natural Language (0.71)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition
Xu, Peiran, Wang, Sudong, Zhu, Yao, Li, Jianing, Zhang, Yunjian
Spatial cognition is fundamental to real-world multimodal intelligence, allowing models to effectively interact with the physical environment. While multimodal large language models (MLLMs) have made significant strides, existing benchmarks often oversimplify spatial cognition, reducing it to a single-dimensional metric, which fails to capture the hierarchical structure and interdependence of spatial abilities. T o address this gap, we propose a hierarchical spatial cognition framework that decomposes spatial intelligence into five progressively complex levels from basic observation to high-level planning. Building upon this taxonomy, we construct SpatialBench, a large-scale, fine-grained benchmark covering 15 tasks aligned with these cognitive levels. T o provide a unified evaluation across heterogeneous tasks, we further introduce a high-level capability-oriented metric that reliably assesses a model's overall spatial reasoning ability. Extensive experiments over massive MLLMs reveal distinct performance stratification across cognitive levels: models exhibit strong perceptual grounding yet remain limited in symbolic reasoning, causal inference, and planning. Additional human tests demonstrate that humans perform selective, goal-directed abstraction, while MLLMs tend to over-attend to surface details without coherent spatial intent. Our work establishes the first systematic framework for measuring hierarchical spatial cognition in MLLMs, laying the foundation for future spatially intelligent systems.
- Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks
Lian, Shijie, Wu, Changti, Yang, Laurence Tianruo, Yuan, Hang, Yu, Bin, Zhang, Lei, Chen, Kai
Spatial intelligence spans a rich suite of abilities, including visualising and transforming shapes, mentally rotating objects, judging relational positions and containment, and estimating numerosity. However, it still remains a critical unresolved challenge for Multimodal Large Language Models (MLLMs). To fill this gap, we propose to treat Euclidean geometry problem-solving as a surrogate task. Specifically, we meticulously constructed a curated multimodal dataset, called Euclid30K, comprising approximately 30K plane and solid geometry problems. Furthermore, to enable the model to learn and apply Euclidean principles from these geometry problems, we fine-tuned seven model variants (spanning 3--72B parameters) from the Qwen2.5VL, Qwen3VL, and RoboBrain2.0 families using Group Relative Policy Optimization (GRPO), inspiring the models to identify shapes, count, and relate entities, and perform multi-step deductive reasoning using Euclidean principles. Our experiments demonstrate that the resulting models achieve substantial zero-shot gains across four spatial reasoning benchmarks (Super-CLEVR, Omni3DBench, VSI-Bench, and MindCube) without any task-specific adaptations. Notably, after training on the Euclid30K, the mean VSI-Bench accuracy rose from 36.6\% to 41.8\% (+5.2\%), and the mean MindCube accuracy rose from 31.4\% to 38.1\% (+6.7\%). To our knowledge, this is the first systematic study showing that geometry-centric fine-tuning can confer vision-language models with broadly transferable spatial skills. Code and Euclid30K dataset can be found in \href{https://zgca-ai4edu.github.io/Euclids_Gift}{this}.
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective
Yu, Songsong, Chen, Yuxin, Ju, Hao, Jia, Lianjie, Zhang, Fuxi, Huang, Shaofei, Wu, Yuhan, Cui, Rundi, Ran, Binghao, Zhang, Zaibin, Zheng, Zhedong, Zhang, Zhipeng, Wang, Yifan, Song, Lin, Wang, Lijun, Li, Yanwei, Shan, Ying, Lu, Huchuan
Visual Spatial Reasoning (VSR) is a core human cognitive ability and a critical requirement for advancing embodied intelligence and autonomous systems. Despite recent progress in Vision-Language Models (VLMs), achieving human-level VSR remains highly challenging due to the complexity of representing and reasoning over three-dimensional space. In this paper, we present a systematic investigation of VSR in VLMs, encompassing a review of existing methodologies across input modalities, model architectures, training strategies, and reasoning mechanisms. Furthermore, we categorize spatial intelligence into three levels of capability, ie, basic perception, spatial understanding, spatial planning, and curate SIBench, a spatial intelligence benchmark encompassing nearly 20 open-source datasets across 23 task settings. Experiments with state-of-the-art VLMs reveal a pronounced gap between perception and reasoning, as models show competence in basic perceptual tasks but consistently underperform in understanding and planning tasks, particularly in numerical estimation, multi-view reasoning, temporal dynamics, and spatial imagination. These findings underscore the substantial challenges that remain in achieving spatial intelligence, while providing both a systematic roadmap and a comprehensive benchmark to drive future research in the field. The related resources of this study are accessible at https://sibench.github.io/Awesome-Visual-Spatial-Reasoning/.
- North America > United States (0.14)
- Asia > Macao (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- (3 more...)
- Research Report (1.00)
- Overview (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- (2 more...)
Spatially Intelligent Patrol Routes for Concealed Emitter Localization by Robot Swarms
Morris, Adam, Pelham, Timothy, Hunt, Edmund R.
This paper introduces a method for designing spatially intelligent robot swarm behaviors to localize concealed radio emitters. We use differential evolution to generate geometric patrol routes that localize unknown signals independently of emitter parameters, a key challenge in electromagnetic surveillance. Patrol shape and antenna type are shown to influence information gain, which in turn determines the effective triangulation coverage. We simulate a four-robot swarm across eight configurations, assigning pre-generated patrol routes based on a specified patrol shape and sensing capability (antenna type: omnidirectional or directional). An emitter is placed within the map for each trial, with randomized position, transmission power and frequency. Results show that omnidirectional localization success rates are driven primarily by source location rather than signal properties, with failures occurring most often when sources are placed in peripheral areas of the map. Directional antennas are able to overcome this limitation due to their higher gain and directivity, with an average detection success rate of 98.75% compared to 80.25% for omnidirectional. Average localization errors range from 1.01-1.30 m for directional sensing and 1.67-1.90 m for omnidirectional sensing; while directional sensing also benefits from shorter patrol edges. These results demonstrate that a swarm's ability to predict electromagnetic phenomena is directly dependent on its physical interaction with the environment. Consequently, spatial intelligence, realized here through optimized patrol routes and antenna selection, is a critical design consideration for effective robotic surveillance.
NavSpace: How Navigation Agents Follow Spatial Intelligence Instructions
Yang, Haolin, Long, Yuxing, Yu, Zhuoyuan, Yang, Zihan, Wang, Minghan, Xu, Jiapeng, Wang, Yihan, Yu, Ziyan, Cai, Wenzhe, Kang, Lei, Dong, Hao
Instruction-following navigation is a key step toward embodied intelligence. Prior benchmarks mainly focus on semantic understanding but overlook systematically evaluating navigation agents' spatial perception and reasoning capabilities. In this work, we introduce the NavSpace benchmark, which contains six task categories and 1,228 trajectory-instruction pairs designed to probe the spatial intelligence of navigation agents. On this benchmark, we comprehensively evaluate 22 navigation agents, including state-of-the-art navigation models and multimodal large language models. The evaluation results lift the veil on spatial intelligence in embodied navigation. Furthermore, we propose SNav, a new spatially intelligent navigation model. SNav outperforms existing navigation agents on NavSpace and real robot tests, establishing a strong baseline for future work.
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Vision (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)
Blueprint-Bench: Comparing spatial intelligence of LLMs, agents and image models
Petersson, Lukas, Backlund, Axel, Wennstöm, Axel, Petersson, Hanna, Sharrock, Callum, Dabiri, Arash
We introduce Blueprint-Bench, a benchmark designed to evaluate spatial reasoning capabilities in AI models through the task of converting apartment photographs into accurate 2D floor plans. While the input modality (photographs) is well within the training distribution of modern multimodal models, the task of spatial reconstruction requires genuine spatial intelligence: inferring room layouts, understanding connectivity, and maintaining consistent scale. We evaluate leading language models (GPT-5, Claude 4 Opus, Gemini 2.5 Pro, Grok-4), image generation models (GPT-Image, NanoBanana), and agent systems (Codex CLI, Claude Code) on a dataset of 50 apartments with approximately 20 interior images each. Our scoring algorithm measures similarity between generated and ground-truth floor plans based on room connectivity graphs and size rankings. Results reveal a significant blind spot in current AI capabilities: most models perform at or below a random baseline, while human performance remains substantially superior. Image generation models particularly struggle with instruction following, while agent-based approaches with iterative refinement capabilities show no meaningful improvement over single-pass generation. Blueprint-Bench provides the first numerical framework for comparing spatial intelligence across different model architectures. We will continue evaluating new models as they are released and welcome community submissions, monitoring for the emergence of spatial intelligence in generalist AI systems.
MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents
Wei, Ziming, Lin, Bingqian, Jiao, Zijian, Nie, Yunshuang, Ma, Liang, Liu, Yuecheng, Zhuang, Yuzheng, Liang, Xiaodan
Spatial Planning is a crucial part in the field of spatial intelligence, which requires the understanding and planning about object arrangements in space perspective. AI agents with the spatial planning ability can better adapt to various real-world applications, including robotic manipulation, automatic assembly, urban planning etc. Recent works have attempted to construct benchmarks for evaluating the spatial intelligence of Multimodal Large Language Models (MLLMs). Nevertheless, these benchmarks primarily focus on spatial reasoning based on typical Visual Question-Answering (VQA) forms, which suffers from the gap between abstract spatial understanding and concrete task execution. In this work, we take a step further to build a comprehensive benchmark called MineAnyBuild, aiming to evaluate the spatial planning ability of open-world AI agents in the Minecraft game. Specifically, MineAnyBuild requires an agent to generate executable architecture building plans based on the given multi-modal human instructions. It involves 4,000 curated spatial planning tasks and also provides a paradigm for infinitely expandable data collection by utilizing rich player-generated content. MineAnyBuild evaluates spatial planning through four core supporting dimensions: spatial understanding, spatial reasoning, creativity, and spatial commonsense. Based on MineAnyBuild, we perform a comprehensive evaluation for existing MLLM-based agents, revealing the severe limitations but enormous potential in their spatial planning abilities. We believe our MineAnyBuild will open new avenues for the evaluation of spatial intelligence and help promote further development for open-world AI agents capable of spatial planning.
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.67)
- Leisure & Entertainment > Games > Computer Games (1.00)
- Information Technology (0.92)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence
Yang, Sihan, Xu, Runsen, Xie, Yiman, Yang, Sizhe, Li, Mo, Lin, Jingli, Zhu, Chenming, Chen, Xiaochen, Duan, Haodong, Yue, Xiangyu, Lin, Dahua, Wang, Tai, Pang, Jiangmiao
Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-image spatial reasoning that real-world deployments demand. We introduce MMSI-Bench, a VQA benchmark dedicated to multi-image spatial intelligence. Six 3D-vision researchers spent more than 300 hours meticulously crafting 1,000 challenging, unambiguous multiple-choice questions from over 120,000 images, each paired with carefully designed distractors and a step-by-step reasoning process. We conduct extensive experiments and thoroughly evaluate 34 open-source and proprietary MLLMs, observing a wide gap: the strongest open-source model attains roughly 30% accuracy and OpenAI's o3 reasoning model reaches 40%, while humans score 97%. These results underscore the challenging nature of MMSI-Bench and the substantial headroom for future research. Leveraging the annotated reasoning processes, we also provide an automated error analysis pipeline that diagnoses four dominant failure modes, including (1) grounding errors, (2) overlap-matching and scene-reconstruction errors, (3) situation-transformation reasoning errors, and (4) spatial-logic errors, offering valuable insights for advancing multi-image spatial intelligence. Project page: https://runsenxu.com/projects/MMSI_Bench .
- Asia > China > Shanghai > Shanghai (0.04)
- Asia > China > Hong Kong (0.04)
- North America > United States > North Dakota > Burke County (0.04)
- (2 more...)
- Information Technology (0.46)
- Law (0.46)